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We report on coding and performance of our polynomial hybrid Monte Carlo program on the Earth Simulator. 
At present the entire program achieves 25-40% efficiency. An analysis of overheads shows that a tuning of 
inter- node communications is required for further improvement. 



1. Introduction 

The joint CP-PACS/JLQCD CoUaborations 
have been carrying out ^ three-flavor full QCD 
simulations with the Iwasaki RG gauge action 
and the non-perturbatively 0(a) improved Wil- 
son quark action using the polynomial Hybrid 
Monte Carlo (PHMC) method |2] on a variety 
of computers. One of the computers is the Earth 
Simulator(ES) at the ES Center It was made 
available under the project "Study of the Stan- 
dard Model of Elementary Particles on the Lat- 
tice with the Earth Simulator" which was ap- 
proved as one of the "Epoch Making Simulation 
Projects" of the ES Center . Here we report on 
the performance of our PHMC program on ES. 

2. Coding on ES 

The ES consists of 640 processing nodes (PN) 
connected by a one-dimensional crossbar switch 
with 12.3 GB/s bi-directional bandwidth. Each 
PN is an SMP with 8 vector-type arithmetic pro- 
cessors (AP), each with a peak speed of SGFlops. 
Among several programming models, we employ 
micro-tasking by hand parallclization for 8 AP's 
of a single PN and MPI for communications be- 
tween PN's. 

Our PHMC code was originally developed at 
KEK for Hitachi SR8000 and sustains as much 
as 40% of peak speed as a whole. On the ES, 
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however, the performance turned out to be of or- 
der 10%. To explore an effective coding style, 
we take the basic MULT subroutine for Wilson- 
Dirac matrix-vector multiplication, and measure 
the performance for seven types of coding on a 
single AP. Results range from 10 to 73% for a 
6 X 6 X 48 X 48 lattice. 

The highest performance is achieved by a code 
written originally for a vector-parallel machine, 
Fujitsu VPP500. In this code, sites are one- 
dimensionalized on the z — t plane and divided by 
four to realize a large vector length without list 
vectors. Other codes with smaller vector lengths 
(e.g. that for Hitachi SR8000 in which only t di- 
rection is vectorized) show low performance of at 
most 30%. Hence we rewrite the entire code with 
the coding style used for VPP500. 

We subdivide the whole lattice on the x — y 
plane and assign each region to a PN and par- 
allelize with MPI. In one PN, loop indices in 
the X and y directions and an even/odd flag 
for four vector loops mentioned above are one- 
dimensionalized and divided by eight. This en- 
ables the compiler to do an automatic parallcliza- 
tion (micro-tasking) of all appropriate do-loops. 

3. Performance of MULT 

Since performance depends on the lattice size 
and the number of PN, we examine three cases, 
a) 20^ X A^t on 5 X 2 PN (4 x 10 x 20 x Nt per 
PN), b) 243 X A^t on 3 X 4 PN (8 X 6 X 24 X A^t per 
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Figure 1. MULT vector performance on lAP for 
an 8 X 6 X 24 X A^i lattice. 



PN), and c) 323 X At on 4 X 4 FN (8 X 8 X 32 X 
per FN). 

3.1. Vector Processing on single AP 

The vector processing performance on a sin- 
gle AP is an important fundamental. Our pro- 
gram includes redundant arithmetic calculations 
at edges in the z direction to realize a long vec- 
tor length. Therefore we distinguish the perfor- 
mance reported by the system analyzer '"ftrace" 
and that calculated theoretically for an effective 
part excluding the redundant operations. For the 
latter, total flops equals 1296flopsx#sites. The 
redundant part costs 2 — 4% of peak performance. 

Figure ^ shows the single AP performance of 
MULT for case b) plotted versus A^t. We test two 
codes. In the original one, contributions from 8 
directions are calculated in one large do-loop and 
are summed up later. In the revised code, which 
intends to overlay arithmetic operations and com- 
munications in future, the large loop is divided 
into two loops for z,i and x^y directions. The 
array structure is also different. The revised code 
runs about 15% slower. We suppose that this is 
partly caused by a slow startup of do- loops. 

In general, the vector performance of the re- 
vised MULT code reaches 55-65% for all three 
cases. However, it drops by about 10% when the 
vector length just exceeds a multiple of the size 
of vector registers, 256. 
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Figure 2. MULT efficiency on IPN versus #AP. 

3.2. Micro- Tasking Parallelization 

The cost of automatic parallelization is another 
important point, because SMP of vector proces- 
sors is a distinctive feature of PN. Figure|3shows 
the efficiency against #AP for 8 x 8 x 32 x 62 and 
8 X 6 X 24 X 48 lattices. The micro-tasking paral- 
lelization costs 3 to 4% which is not so high, while 
memory copy to implement boundary conditions 
is relatively heavy, being 4% for 1 AP and 7% for 
8 AP's. 

3.3. MPI Communications 

In our code one PN issues an MPLirsend for 
a gathered data and an adjacent PN issues an 
MPLirecv and then scatters the received data. 
This enables us to construct long messages. The 
message size ranges from 0.34MB to 1.64MB, 
and the throughput ranges from 1.62GB/sec to 
5.56GB/sec. These numbers are consistent with 
a MPI performance report from the ES Center. 

Communication performance drops by 20% due 
to buffer copy for gather/scatter, e.g., the 
throughput for the longest message drops to 
4.35GB/sec. 

3.4. Breakdown of Overheads 

In order to show how various overheads af- 
fect the overall efficiency, we show in Fig. O the 
MULT performance starting from 1 AP up to 72 
AP's (9 PN's). The lattice volume per AP is 
fixed to 4 X 2 X 32 X 62, which corresponds to 
a 32 X 32 X 32 X 62 lattice on a 4 x 4 PN ar- 
ray. The performance of 61.6% for 1 AP finally 
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Figure 3. Efficiency of MULT versus #AP. 



300 



2. 

■D 200 



% 100 

Z3 

m 



• • measured 

if paralellization efficiency is 1 00% 



3x3 



3x2 



2x1 



3x1 



2x2 



8 16 24 32 40 48 56 64 72 80 
#AP 

Figure 4. MULT sustained speed for a 24^ x 48 
lattice versus #AP. 



drops to 35.4% for 72 AP's. The main cause of 
the drop is a sfow speed of MPI communications 
which are not overlaid with arithmetic operations, 
and secondly the cost of memory copy in one PN, 
which together costs 40% relative to the total ex- 
ecution time. The fraction becomes higher when 
volume/node becomes smaller; for a 20'^ x 40 lat- 
tice on a 5 X 2 PN array, 60% of execution time 
is spent for communications and memory copy. 

Figure^ shows the sustained speed in GFlops 
versus #AP for a 24^ x 48 lattice. In this case, 
lattice size is fixed for all measurements. For 1 
AP, 4.12GFlops (52% efficiency) is achived, and 
164.83GFlops for 72 AP's is about 40 times that 
for 1 AP. In other words, the parallelization effi- 
ciency is 40/72 « 56%. 



Table 1 

Profile (%), performance (GFlops) per lAP and 
total efficiency (%) of the PHMC program. 



size 


20^ • 40 


24^ • 48 


32^ • 64 


32^ • 62 


node 


5-2 


3-4 


4-4 


4-4 


copy 


36.9 


28.6 


21.4 


24.6 


MULT 


24.8 


29.4 


34.8 


34.1 


MLCI 


13.4 


18.3 


20.1 


17.8 


BiCG 


9.8 


9.2 


8.8 


8.9 


Pcrf 


2.01 


2.29 


2.87 


3.18 


ES 


24.9 


28.6 


35.9 


39.8 



4. Performance of the PHMC Program 

Table H shows the profile of the entire 
PHMC program as provided by the ES pro- 
filer. The arithmetic calculations and boundary 
copy/communications in MULT are the two heav- 
iest routines. The multiplication of the inverse 
clover term (MLCI) and BiCGStab (BiCG) are 
the third and fourth heaviest, which are relatively 
light. Therefore, for the next round of simula- 
tions on a 32^^ x Nt lattice, we plan to overlay 
arithmetic operations and communications in the 
MULT routine. 

As Tabled shows the PHMC program runs on 
the ES with an efficiency of 25-40% for our four 
target lattice sizes. Currently we are executing 
on a 20^ x 40 lattice at m^/mp k. 0.6. The ef- 
ficiency of 31% on the ES is comparable to that 
on other machines, 35% on SR8000/F1 at KEK 
with 32 nodes, 44% on VPP5000 at Tsukuba with 
8 nodes, and 20% on CP-PACS with 500 nodes. 
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